Aidan Henbest
Dr. Bixler
Computer Science
15 June 2022
$\;\;\;\;\;\;$The data set that was chosen is from the report "How to Analyze Your Gender Pay Gap: An Employer’s Guide" on the website www.glassdoor.com. Specifically, the link to the data set comes from this page: https://www.glassdoor.com/research/how-to-analyze-gender-pay-gap-employers-guide/. This data set was chosen because of its relevance in today's world, in which pay equality is greatly disputed. The pay inequality that is often discussed, the gender pay gap, can be analyzed using this data set. However, this data set can be analyzed in many other ways too. It has extensive data on one thousand employees, including their job title, gender, age, performance evaluation score, education level, department, seniority level, yearly base pay, and bonus pay. All of this data can be utilized in many different ways to answer many different interesting questions regarding employee statistics. The questions that are answered in this analysis include these six:
While the depth of this data set could certainly be explored further, with the time allotted only these six questions were able to be analyzed. These six questions have interesting answers, and they will be answered later on in this analysis.
# Import statements
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from pandas.api.types import CategoricalDtype
from matplotlib import cm
# Creates a method that allows values to be shown on a bar graph
def show_values(axs, orient='v', style='{:,.2f}'):
def single(ax):
if orient == 'v':
for p in ax.patches:
x = p.get_x() + p.get_width() / 2
y = p.get_y() + p.get_height() / 2
value = style.format(p.get_height())
ax.text(x, y, value, ha='center', rotation=90)
elif orient == 'h':
for p in ax.patches:
x = p.get_x() + p.get_width() / 2
y = p.get_y() + p.get_height() - (p.get_height()*0.5)
value = style.format(p.get_width())
ax.text(x, y, value, ha='left')
if isinstance(axs, np.ndarray):
for idx, ax in np.ndenumerate(axs):
single(ax)
else:
single(axs)
# Set Style
sns.set()
# Creates the main data frame
pay = pd.read_csv('./data/Glassdoor.csv', sep=',')
pay.rename(columns={'jobTitle': 'Job Title', 'gender': 'Gender', 'age': 'Age',
'perfEval': 'Performance Evaluation', 'edu': 'Education',
'dept': 'Department', 'seniority': 'Seniority', 'basePay': 'Base Pay',
'bonus': 'Bonus'}, inplace=True)
pay['Percent Bonus'] = 100 * (pay['Bonus'] / pay['Base Pay'])
pay['Total Pay'] = pay['Bonus'] + pay['Base Pay']
# Creates a data frame in which the education values are assigned to a custom data type, so they are ordered correctly
pay_my_type = pd.read_csv('./data/Glassdoor.csv', sep=',')
education_order = CategoricalDtype(['High School', 'College', 'Masters', 'PhD'], ordered=True)
pay_my_type.rename(columns={'jobTitle': 'Job Title', 'gender': 'Gender', 'age': 'Age',
'perfEval': 'Performance Evaluation', 'edu': 'Education',
'dept': 'Department', 'seniority': 'Seniority', 'basePay': 'Base Pay',
'bonus': 'Bonus'}, inplace=True)
pay_my_type['Education'] = pay_my_type['Education'].astype(education_order)
pay_my_type['Percent Bonus'] = 100 * (pay_my_type['Bonus'] / pay_my_type['Base Pay'])
pay_my_type['Total Pay'] = pay_my_type['Bonus'] + pay_my_type['Base Pay']
# Creates a data frame in which all of the categorical data is converted to numerical data
pay_num = pd.read_csv('./data/Glassdoor.csv', sep=',')
pay_num.rename(columns={'jobTitle': 'Job Title', 'gender': 'Gender', 'age': 'Age',
'perfEval': 'Performance Evaluation', 'edu': 'Education',
'dept': 'Department', 'seniority': 'Seniority', 'basePay': 'Base Pay',
'bonus': 'Bonus'}, inplace=True)
pay_num['Job Title'].replace(['Graphic Designer', 'Software Engineer', 'Warehouse Associate', 'IT',
'Sales Associate', 'Driver', 'Financial Analyst', 'Marketing Associate',
'Data Scientist', 'Manager'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], inplace=True)
pay_num['Gender'].replace(['Male', 'Female'], [0, 1], inplace=True)
pay_num['Education'].replace(['High School', 'College', 'Masters', 'PhD'], [0, 1, 2, 3], inplace=True)
pay_num['Department'].replace(['Operations', 'Sales', 'Management', 'Administration', 'Engineering'],
[0, 1, 2, 3, 4], inplace=True)
pay_num['Percent Bonus'] = 100 * (pay_num['Bonus'] / pay_num['Base Pay'])
pay_num['Total Pay'] = pay_num['Bonus'] + pay_num['Base Pay']
# Shows the name of each column, number of entries in each column, and the data type of each column
pay.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job Title 1000 non-null object 1 Gender 1000 non-null object 2 Age 1000 non-null int64 3 Performance Evaluation 1000 non-null int64 4 Education 1000 non-null object 5 Department 1000 non-null object 6 Seniority 1000 non-null int64 7 Base Pay 1000 non-null int64 8 Bonus 1000 non-null int64 9 Percent Bonus 1000 non-null float64 10 Total Pay 1000 non-null int64 dtypes: float64(1), int64(6), object(4) memory usage: 86.1+ KB
$\;\;\;\;\;\;$This function shows the title of each column in the main data frame: job title, gender, age, performance evaluation, education, department, seniority, base pay, bonus, percent bonus, and total pay. It also shows that each column has one thousand pieces of data in it. Lastly, this function shows the data type for each column. Job title is an object, gender is an object, age is an int64, performance evaluation is an int64, education is an object, department is an object, seniority is an int64, base pay is an int64, bonus is an int64, percent bonus is a float64, and total pay is an int64.
# Shows the memory used by each column in the main data frame
pay.memory_usage(deep=True)
Index 128 Job Title 70468 Gender 61936 Age 8000 Performance Evaluation 8000 Education 64108 Department 66929 Seniority 8000 Base Pay 8000 Bonus 8000 Percent Bonus 8000 Total Pay 8000 dtype: int64
$\;\;\;\;\;\;$This function shows the amount of memory being taken up by each column in the main data frame, in bytes. From this we can see that job title takes up 70,468 bytes, gender takes up 61,936 bytes, age takes up 8,000 bytes, performance evaluation takes up 8,000 bytes, education takes up 64,108 bytes, department takes up 66,929 bytes, seniority takes up 8,000 bytes, base pay takes up 8,000 bytes, bonus takes up 8,000 bytes, percent bonus takes up 8,000 bytes, and total pay takes up 8,000 bytes.
# Shows the memory used by each column in the pay_num data frame
pay_num.memory_usage(deep=True)
Index 128 Job Title 8000 Gender 8000 Age 8000 Performance Evaluation 8000 Education 8000 Department 8000 Seniority 8000 Base Pay 8000 Bonus 8000 Percent Bonus 8000 Total Pay 8000 dtype: int64
$\;\;\;\;\;\;$This function shows the amount of memory being taken up by each column in the pay_num data frame, in bytes. This data frame has all of the data converted to numbers, which allows for some easier analysis. From this we can see that job title takes up 8,000 bytes, gender takes up 8,000 bytes, age takes up 8,000 bytes, performance evaluation takes up 8,000 bytes, education takes up 8,000 bytes, department takes up 8,000 bytes, seniority takes up 8,000 bytes, base pay takes up 8,000 bytes, bonus takes up 8,000 bytes, percent bonus takes up 8,000 bytes, and total pay takes up 8,000 bytes.
# Shows the memory used by each column in the pay_my_type data frame
pay_my_type.memory_usage(deep=True)
Index 128 Job Title 70468 Gender 61936 Age 8000 Performance Evaluation 8000 Education 1428 Department 66929 Seniority 8000 Base Pay 8000 Bonus 8000 Percent Bonus 8000 Total Pay 8000 dtype: int64
$\;\;\;\;\;\;$This function shows the amount of memory being taken up by each column in the pay_my_type data frame, in bytes. This data frame has the education column converted to a custom data type, which allows for the levels of education to be ordered correctly. From this we can see that job title takes up 70,468 bytes, gender takes up 61,936 bytes, age takes up 8,000 bytes, performance evaluation takes up 8,000 bytes, education takes up 1,428 bytes, department takes up 66,929 bytes, seniority takes up 8,000 bytes, base pay takes up 8,000 bytes, bonus takes up 8,000 bytes, percent bonus takes up 8,000 bytes, and total pay takes up 8,000 bytes.
# Shows the number of values missing in each column
pay.isna().sum()
Job Title 0 Gender 0 Age 0 Performance Evaluation 0 Education 0 Department 0 Seniority 0 Base Pay 0 Bonus 0 Percent Bonus 0 Total Pay 0 dtype: int64
$\;\;\;\;\;\;$This function shows the number of values missing from each column of the main data frame, but since all of the other data frames only convert values from the main one to a different data type, this information applies to all of the data frames. From this we can see that job title is missing zero values, gender is missing zero values, age is missing zero values, performance evaluation is missing zero values, education is missing zero values, department is missing zero values, seniority is missing zero values, base pay is missing zero values, bonus is missing zero values, percent bonus is missing zero values, and total pay is missing zero values.
# Shows the sum of all of the values in each column in the pay_num data frame
pay_num.cumsum().tail(1).style.format({'Job Title': '{:,.2f}', 'Gender': '{:,.2f}', 'Age': '{:,.2f}',
'Performance Evaluation': '{:,.2f}', 'Education': '{:,.2f}',
'Department': '{:,.2f}', 'Seniority': '{:,.2f}', 'Base Pay': '${:,.0f}',
'Bonus': '${:,.0f}', 'Total Pay': '${:,.0f}', 'Percent Bonus': '{:,.2f}%'})
| Job Title | Gender | Age | Performance Evaluation | Education | Department | Seniority | Base Pay | Bonus | Percent Bonus | Total Pay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 999 | 4,542.00 | 468.00 | 41,393.00 | 3,037.00 | 1,467.00 | 1,950.00 | 2,971.00 | $94,472,653 | $6,467,161 | 7,515.73% | $100,939,814 |
$\;\;\;\;\;\;$This function shows the cumulative sums of all of the values in each column of the pay_num data frame. From this we can see that job title sums to 4,542; gender sums to 468; age sums to 41,393; performance evaluation sums to 3,037; education sums to 1,467;department sums to 1,950;seniority sums to 2,971;base pay sums to \$94,472,653; bonus sums to \\$6,467,161;percent bonus sums to 7,515.73%;and total pay sums to \$100,939,814.
# Shows the first 10 rows of the main data frame
pay.head(10).style.format({'Base Pay': '${:,.0f}', 'Bonus': '${:,.0f}',
'Total Pay': '${:,.0f}', 'Percent Bonus': '{:,.2f}%'})
| Job Title | Gender | Age | Performance Evaluation | Education | Department | Seniority | Base Pay | Bonus | Percent Bonus | Total Pay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Graphic Designer | Female | 18 | 5 | College | Operations | 2 | $42,363 | $9,938 | 23.46% | $52,301 |
| 1 | Software Engineer | Male | 21 | 5 | College | Management | 5 | $108,476 | $11,128 | 10.26% | $119,604 |
| 2 | Warehouse Associate | Female | 19 | 4 | PhD | Administration | 5 | $90,208 | $9,268 | 10.27% | $99,476 |
| 3 | Software Engineer | Male | 20 | 5 | Masters | Sales | 4 | $108,080 | $10,154 | 9.39% | $118,234 |
| 4 | Graphic Designer | Male | 26 | 5 | Masters | Engineering | 5 | $99,464 | $9,319 | 9.37% | $108,783 |
| 5 | IT | Female | 20 | 5 | PhD | Operations | 4 | $70,890 | $10,126 | 14.28% | $81,016 |
| 6 | Graphic Designer | Female | 20 | 5 | College | Sales | 4 | $67,585 | $10,541 | 15.60% | $78,126 |
| 7 | Software Engineer | Male | 18 | 4 | PhD | Engineering | 5 | $97,523 | $10,240 | 10.50% | $107,763 |
| 8 | Graphic Designer | Female | 33 | 5 | High School | Engineering | 5 | $112,976 | $9,836 | 8.71% | $122,812 |
| 9 | Sales Associate | Female | 35 | 5 | College | Engineering | 5 | $106,524 | $9,941 | 9.33% | $116,465 |
$\;\;\;\;\;\;$This function shows the first 10 rows of the main data frame, including all of the values from each column in the data frame.
# Shows the last 10 rows of the main data frame
pay.tail(10).style.format({'Base Pay': '${:,.0f}', 'Bonus': '${:,.0f}',
'Total Pay': '${:,.0f}', 'Percent Bonus': '{:,.2f}%'})
| Job Title | Gender | Age | Performance Evaluation | Education | Department | Seniority | Base Pay | Bonus | Percent Bonus | Total Pay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 990 | Graphic Designer | Female | 61 | 1 | Masters | Engineering | 1 | $91,030 | $3,318 | 3.64% | $94,348 |
| 991 | IT | Female | 65 | 1 | Masters | Administration | 1 | $106,945 | $2,041 | 1.91% | $108,986 |
| 992 | Graphic Designer | Female | 63 | 1 | College | Administration | 2 | $81,545 | $3,418 | 4.19% | $84,963 |
| 993 | Marketing Associate | Female | 65 | 1 | Masters | Administration | 1 | $80,789 | $1,884 | 2.33% | $82,673 |
| 994 | Marketing Associate | Female | 64 | 1 | PhD | Administration | 2 | $85,253 | $2,777 | 3.26% | $88,030 |
| 995 | Marketing Associate | Female | 61 | 1 | High School | Administration | 1 | $62,644 | $3,270 | 5.22% | $65,914 |
| 996 | Data Scientist | Male | 57 | 1 | Masters | Sales | 2 | $108,977 | $3,567 | 3.27% | $112,544 |
| 997 | Financial Analyst | Male | 48 | 1 | High School | Operations | 1 | $92,347 | $2,724 | 2.95% | $95,071 |
| 998 | Financial Analyst | Male | 65 | 2 | High School | Administration | 1 | $97,376 | $2,225 | 2.28% | $99,601 |
| 999 | Financial Analyst | Male | 60 | 1 | PhD | Sales | 2 | $123,108 | $2,244 | 1.82% | $125,352 |
$\;\;\;\;\;\;$This function shows the last 10 rows of the main data frame, including all of the values from each column in the data frame.
# Performs a basic statistical analysis on the pay_num data frame
pay_num.describe().style.format('{:,.2f}')
| Job Title | Gender | Age | Performance Evaluation | Education | Department | Seniority | Base Pay | Bonus | Percent Bonus | Total Pay | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1,000.00 | 1,000.00 | 1,000.00 | 1,000.00 | 1,000.00 | 1,000.00 | 1,000.00 | 1,000.00 | 1,000.00 | 1,000.00 | 1,000.00 |
| mean | 4.54 | 0.47 | 41.39 | 3.04 | 1.47 | 1.95 | 2.97 | 94,472.65 | 6,467.16 | 7.52 | 100,939.81 |
| std | 2.87 | 0.50 | 14.29 | 1.42 | 1.12 | 1.42 | 1.40 | 25,337.49 | 2,004.38 | 3.55 | 25,156.60 |
| min | 0.00 | 0.00 | 18.00 | 1.00 | 0.00 | 0.00 | 1.00 | 34,208.00 | 1,703.00 | 1.55 | 40,828.00 |
| 25% | 2.00 | 0.00 | 29.00 | 2.00 | 0.00 | 1.00 | 2.00 | 76,850.25 | 4,849.50 | 4.90 | 83,443.00 |
| 50% | 5.00 | 0.00 | 41.00 | 3.00 | 1.00 | 2.00 | 3.00 | 93,327.50 | 6,507.00 | 6.95 | 100,047.00 |
| 75% | 7.00 | 1.00 | 54.25 | 4.00 | 2.00 | 3.00 | 4.00 | 111,558.00 | 8,026.00 | 9.43 | 117,656.00 |
| max | 9.00 | 1.00 | 65.00 | 5.00 | 3.00 | 4.00 | 5.00 | 179,726.00 | 11,293.00 | 23.46 | 184,010.00 |
$\;\;\;\;\;\;$This function performs basic statistical analysis on the pay_num data frame. This statistical analysis includes functions like the count, mean, standard deviation, minimum, 25th percentile, 50th percentile, 75th percentile, and maximum. This statistical analysis is performed on every column of the data frame. From this we can see many things; however, the most important values include the mean gender, age, performance evaluation, education, seniority, base pay, bonus, percent bonus, and total pay. The mean age is 41.29, the mean performance evaluation is 3.04 (out of 5), the mean seniority is 2.97 (out of 5), the mean base pay is \$94,472.65, the mean bonus is \\$6,467.16, the mean percent bonus is 7.52%, and the mean total pay is \$100,939.81. The mean gender is 0.47, which means that there is a fairly even number of male and female employees, albeit slightly leaning towards the male side. The mean education is 1.47, which means that most employees have at least completed a college education.
# Shows the number of values in each category of the job title column of the main data frame
pay['Job Title'].value_counts()
Marketing Associate 118 Software Engineer 109 Financial Analyst 107 Data Scientist 107 Graphic Designer 98 IT 96 Sales Associate 94 Driver 91 Warehouse Associate 90 Manager 90 Name: Job Title, dtype: int64
$\;\;\;\;\;\;$This function shows the number of values in each category of the job title column of the main data frame. This shows that there are 118 marketing associates, 109 software engineers, 107 financial analysts, 107 data scientists, 98 graphic designers, 96 IT employees, 94 sales associates, 91 drivers, 90 warehouse associates, and 90 managers. This data shows a fairly even distribution of employees throughout the ten job titles.
# Shows the number of values in each category of the gender column of the main data frame
pay['Gender'].value_counts()
Male 532 Female 468 Name: Gender, dtype: int64
$\;\;\;\;\;\;$This function shows the number of values in each category of the gender column of the main data frame. This shows that there are 532 males and 468 females. This data shows a fairly even distribution of employees between the two genders.
# Shows the number of values in each category of the performance evaluation column of the main data frame
pay['Performance Evaluation'].value_counts()
5 209 4 207 1 198 3 194 2 192 Name: Performance Evaluation, dtype: int64
$\;\;\;\;\;\;$This function shows the number of values in each category of the performance evaluation column of the main data frame. This shows that there are 209 employees with a 5 performance evaluation score, 207 employees with a 4 performance evaluation score, 198 employees with a 1 performance evaluation score, 194 employees with a 3 performance evaluation score, and 192 employees with a 2 performance evaluation score. This data shows a fairly even distribution of employees throughout the five performance evaluation scores.
# Shows the number of values in each category of the education column of the main data frame
pay['Education'].value_counts()
High School 265 Masters 256 College 241 PhD 238 Name: Education, dtype: int64
$\;\;\;\;\;\;$This function shows the number of values in each category of the education column of the main data frame. This shows that there are 265 employees with a high school level education, 256 employees with a masters level education, 241 employees with a college level education, and 238 employees with a PhD level education. This data shows a fairly even distribution of employees throughout the four levels of education.
# Shows the number of values in each category of the department column of the main data frame
pay['Department'].value_counts()
Operations 210 Sales 207 Management 198 Administration 193 Engineering 192 Name: Department, dtype: int64
$\;\;\;\;\;\;$This function shows the number of values in each category of the department column of the main data frame. This shows that there are 210 employees in operations, 207 employees in sales, 198 employees in management, 193 employees in administration, and 192 employees in engineering. This data shows a fairly even distribution of employees throughout the five departments.
# Shows the number of values in each category of the seniority column of the main data frame
pay['Seniority'].value_counts()
3 219 2 209 1 195 5 193 4 184 Name: Seniority, dtype: int64
z$\;\;\;\;\;\;$This function shows the number of values in each category of the performance evaluation column of the main data frame. This shows that there are 219 employees with a seniority level of 3, 209 employees with a seniority level of 2, 195 employees with a seniority level of 1, 193 employees with a seniority level of 5, and 184 employees with a seniority level of 4. This data shows a fairly even distribution of employees throughout the five seniority levels.
# Groups the data by gender and seniority and then performs the mean of each piece of numerical data in these subgroups
pay[['Education', 'Gender', 'Age', 'Performance Evaluation', 'Seniority', 'Base Pay',
'Bonus', 'Total Pay', 'Percent Bonus']] \
.groupby(['Gender', 'Seniority']).mean().sort_values(by='Gender', ascending=True) \
.style.format({'Base Pay': '${:,.0f}', 'Bonus': '${:,.0f}', 'Age': '{:,.2f}',
'Performance Evaluation': '{:,.2f}', 'Total Pay': '${:,.0f}', 'Percent Bonus': '{:,.2f}%'})
| Age | Performance Evaluation | Base Pay | Bonus | Total Pay | Percent Bonus | ||
|---|---|---|---|---|---|---|---|
| Gender | Seniority | ||||||
| Female | 1 | 41.83 | 3.12 | $69,470 | $6,042 | $75,512 | 9.65% |
| 2 | 42.46 | 2.90 | $79,537 | $6,138 | $85,676 | 8.58% | |
| 3 | 43.45 | 2.95 | $91,282 | $6,453 | $97,735 | 7.62% | |
| 4 | 39.71 | 2.80 | $99,326 | $6,720 | $106,046 | 7.25% | |
| 5 | 41.13 | 2.91 | $109,200 | $7,017 | $116,217 | 6.70% | |
| Male | 1 | 41.27 | 3.21 | $82,302 | $6,021 | $88,323 | 8.12% |
| 2 | 40.73 | 3.04 | $89,505 | $6,050 | $95,556 | 7.49% | |
| 3 | 41.42 | 3.01 | $97,867 | $6,338 | $104,205 | 6.91% | |
| 4 | 40.59 | 3.22 | $107,854 | $6,878 | $114,732 | 6.78% | |
| 5 | 41.00 | 3.16 | $117,800 | $7,126 | $124,926 | 6.31% |
$\;\;\;\;\;\;$This function groups the data by the gender and seniority columns of the main data frame and then performs the mean of each piece of numerical data in all of the subgroups created by this grouping. From this, it can be seen that the mean age of each seniority level, in both the male and female categories, hovers around forty years old. Furthermore, the performance evaluation scores at each seniority level, for each gender, are around three. In addition, it can be seen that the mean base pay and total pay are substantially higher for each seniority level in the male section. However, the bonus and percent bonus columns do not similarly reflect this, they are about the same for each gender, if not slightly higher for the female section.
# Creates a heatmap of the cross tabulation of the gender and seniority columns in the main data frame
plt.figure(figsize=(12,10))
sns.heatmap(pd.crosstab(pay['Gender'], pay['Seniority'], margins=True, margins_name='Total', normalize=True),
cmap=cm.PiYG, annot=True, center=0)
plt.title('Gender and Seniority Cross Tabulation', fontsize=16)
plt.xlabel('Seniority', fontsize=14)
plt.ylabel('Gender', fontsize=14)
Text(84.5, 0.5, 'Gender')
$\;\;\;\;\;\;$This function performs a cross-tabulation of the gender and seniority columns of the main data frame. This shows that the number of employees in each gender and seniority level subgroup is about the same, as all of the values are around 0.1, or 10%.
# Creates a heatmap of the correlation in the pay_num data frame
plt.figure(figsize=(12,10))
sns.heatmap(pay_num.corr(), cmap=cm.PiYG, annot=True, center=0)
plt.title('Correlation', fontsize=16)
plt.xlabel('Column', fontsize=14)
plt.ylabel('Column', fontsize=14)
Text(84.453125, 0.5, 'Column')
$\;\;\;\;\;\;$This function creates a correlation graph of the numerical data from the pay_num data frame. The higher the absolute value of a box in this graph, the higher the correlation is between those two columns. The values can range from a negative one to one. Based on this, it can be seen that the base pay and age columns have a strong correlation, the bonus and age columns have a strong correlation, the percent bonus and age columns have a strong correlation, the total pay and age columns have a strong correlation, the bonus and performance evaluation columns have a strong correlation, the percent bonus and performance evaluation columns have a strong correlation, the base pay and seniority columns have a strong correlation, the total pay and seniority columns have a strong correlation, the percent bonus and base pay columns have a strong correlation, the total pay and base pay columns have a strong correlation, the percent bonus and bonus columns have a strong correlation, and the total pay and percent bonus columns have a strong correlation.
# Creates a bar plot with 4 subplots showing the difference in pay statistics between gender
fig, axes = plt.subplots(1, 4, figsize=(30, 10), sharey=True)
fig.suptitle('Pay Statistics by Gender', fontsize=20)
# Creates base pay and gender subplot
sns.barplot(ax=axes[0], x='Base Pay', y='Gender', data=pay, palette='PiYG')
axes[0].set_title('Gender versus Base Pay', fontsize=16)
axes[0].set_xlabel('Base Pay ($)', fontsize=14)
axes[0].set_ylabel('Gender', fontsize=14)
show_values(axes[0], 'h', '${:,.2f}')
# Creates bonus and gender subplot
sns.barplot(ax=axes[1], x='Bonus', y='Gender', data=pay, palette='PiYG')
axes[1].set_title('Gender versus Bonus', fontsize=16)
axes[1].set_xlabel('Bonus ($)', fontsize=14)
axes[1].set_ylabel('', fontsize=14)
show_values(axes[1], 'h', '${:,.2f}')
# Creates total pay and gender subplot
sns.barplot(ax=axes[2], x='Total Pay', y='Gender', data=pay, palette='PiYG')
axes[2].set_title('Gender versus Total Pay', fontsize=16)
axes[2].set_xlabel('Total Pay ($)', fontsize=14)
axes[2].set_ylabel('', fontsize=14)
show_values(axes[2], 'h', '${:,.2f}')
# Creates percent bonus and gender subplot
sns.barplot(ax=axes[3], x='Percent Bonus', y='Gender', data=pay, palette='PiYG')
axes[3].set_title('Gender versus Percent Bonus', fontsize=16)
axes[3].set_xlabel('Percent Bonus (%)', fontsize=14)
axes[3].set_ylabel('', fontsize=14)
show_values(axes[3], 'h', '{:,.2f}%')
# Creates a strip and box plot with 4 subplots showing the difference in pay statistics between gender
fig, axes = plt.subplots(1, 4, figsize=(30, 10), sharey=True)
fig.suptitle('Pay Statistics by Gender', fontsize=20)
# Creates base pay and gender subplot
sns.stripplot(ax=axes[0], x='Base Pay', y='Gender', data=pay, color='k', alpha=0.7)
sns.boxplot(ax=axes[0], x='Base Pay', y='Gender', data=pay, palette='PiYG', showfliers=False)
axes[0].set_title('Gender versus Base Pay', fontsize=16)
axes[0].set_xlabel('Base Pay ($)', fontsize=14)
axes[0].set_ylabel('Gender', fontsize=14)
# Creates bonus and gender subplot
sns.stripplot(ax=axes[1], x='Bonus', y='Gender', data=pay, color='k', alpha=0.7)
sns.boxplot(ax=axes[1], x='Bonus', y='Gender', data=pay, palette='PiYG', showfliers=False)
axes[1].set_title('Gender versus Bonus', fontsize=16)
axes[1].set_xlabel('Bonus ($)', fontsize=14)
axes[1].set_ylabel('', fontsize=14)
# Creates total pay and gender subplot
sns.stripplot(ax=axes[2], x='Total Pay', y='Gender', data=pay, color='k', alpha=0.7)
sns.boxplot(ax=axes[2], x='Total Pay', y='Gender', data=pay, palette='PiYG', showfliers=False)
axes[2].set_title('Gender versus Total Pay', fontsize=16)
axes[2].set_xlabel('Total Pay ($)', fontsize=14)
axes[2].set_ylabel('', fontsize=14)
# Creates percent bonus and gender subplot
sns.stripplot(ax=axes[3], x='Percent Bonus', y='Gender', data=pay, color='k', alpha=0.7)
sns.boxplot(ax=axes[3], x='Percent Bonus', y='Gender', data=pay, palette='PiYG', showfliers=False)
axes[3].set_title('Gender versus Percent Bonus', fontsize=16)
axes[3].set_xlabel('Percent Bonus (%)', fontsize=14)
axes[3].set_ylabel('', fontsize=14)
Text(0, 0.5, '')
# Creates a swarm and violin plot with 4 subplots showing the difference in pay statistics between gender
fig, axes = plt.subplots(1, 4, figsize=(30, 10), sharey=True)
fig.suptitle('Pay Statistics by Gender', fontsize=20)
# Creates base pay and gender subplot
sns.violinplot(ax=axes[0], x='Base Pay', y='Gender', data=pay, palette='PiYG', inner=None)
sns.swarmplot(ax=axes[0], x='Base Pay', y='Gender', data=pay, color='k', alpha=0.7, s=2.25)
axes[0].set_title('Gender versus Base Pay', fontsize=16)
axes[0].set_xlabel('Base Pay ($)', fontsize=14)
axes[0].set_ylabel('Gender', fontsize=14)
# Creates bonus and gender subplot
sns.violinplot(ax=axes[1], x='Bonus', y='Gender', data=pay, palette='PiYG', inner=None)
sns.swarmplot(ax=axes[1], x='Bonus', y='Gender', data=pay, color='k', alpha=0.7, s=2.25)
axes[1].set_title('Gender versus Bonus', fontsize=16)
axes[1].set_xlabel('Bonus ($)', fontsize=14)
axes[1].set_ylabel('', fontsize=14)
# Creates total pay and gender subplot
sns.violinplot(ax=axes[2], x='Total Pay', y='Gender', data=pay, palette='PiYG', inner=None)
sns.swarmplot(ax=axes[2], x='Total Pay', y='Gender', data=pay, color='k', alpha=0.7, s=2.25)
axes[2].set_title('Gender versus Total Pay', fontsize=16)
axes[2].set_xlabel('Total Pay ($)', fontsize=14)
axes[2].set_ylabel('', fontsize=14)
# Creates percent bonus and gender subplot
sns.violinplot(ax=axes[3], x='Percent Bonus', y='Gender', data=pay, palette='PiYG', inner=None)
sns.swarmplot(ax=axes[3], x='Percent Bonus', y='Gender', data=pay, color='k', alpha=0.7, s=2.25)
axes[3].set_title('Gender versus Percent Bonus', fontsize=16)
axes[3].set_xlabel('Percent Bonus (%)', fontsize=14)
axes[3].set_ylabel('', fontsize=14)
Text(0, 0.5, '')
$\;\;\;\;\;\;$Many different types of graphs were created to analyze the difference in a variety of pay statistics by gender. The pay statistics analyzed include base pay, bonus, total pay, and percent bonus. The types of graphs created include bar graphs, box plots, strip plots, violin plots, and swarm plots. From these graphs, it can be seen that the pay statistics for both males and females have a wide range of values. Despite this, there are not too many outliers in the data in any of the categories other than the percent bonus column. The percent bonus column has many more outliers than the other three columns being analyzed. In addition, it can be seen that the base pay and total pay for male employees are higher than that of female employees. While this result was expected, it was still good to confirm it. More surprisingly, the bonus that employees received did not differ much between genders, as females received slightly higher bonuses. As a result of this, females generally have a substantially higher percent bonus than men. Since the bonuses of females are generally slightly higher than that of males, but their base pays are generally quite a bit lower, their percent bonuses are quite a bit higher. This is surprising, as it would have been expected that both the bonus values and percent bonus values for females would have been lower than those of males. In conclusion, while males have a substantially higher base and total pay, females surprisingly have higher bonuses and percent bonuses than males.
# Creates a bar plot with 8 subplots showing the difference in pay statistics between gender by job title and department
fig, axes = plt.subplots(2, 4, figsize=(30, 20), sharey='row')
fig.suptitle('Pay Statistics by Job Title, Department, and Gender', fontsize=20)
# Creates base pay, job title, and gender subplot
sns.barplot(ax=axes[0, 0], x='Base Pay', y='Job Title', hue='Gender', data=pay, palette='PiYG')
axes[0, 0].set_title('Gender and Job Title versus Base Pay', fontsize=16)
axes[0, 0].set_xlabel('Base Pay ($)', fontsize=14)
axes[0, 0].set_ylabel('Job Title', fontsize=14)
show_values(axes[0, 0], 'h', '${:,.2f}')
sns.move_legend(axes[0, 0], 'upper right', bbox_to_anchor=(4.5, 1.25), frameon=False, fontsize=16, title='')
# Creates bonus, job title, and gender subplot
sns.barplot(ax=axes[0, 1], x='Bonus', y='Job Title', hue='Gender', data=pay, palette='PiYG')
axes[0, 1].set_title('Gender and Job Title versus Bonus', fontsize=16)
axes[0, 1].set_xlabel('Bonus ($)', fontsize=14)
axes[0, 1].set_ylabel('', fontsize=14)
show_values(axes[0, 1], 'h', '${:,.2f}')
axes[0, 1].get_legend().remove()
# Creates total pay, job title, and gender subplot
sns.barplot(ax=axes[0, 2], x='Total Pay', y='Job Title', hue='Gender', data=pay, palette='PiYG')
axes[0, 2].set_title('Gender and Job Title versus Total Pay', fontsize=16)
axes[0, 2].set_xlabel('Total Pay ($)', fontsize=14)
axes[0, 2].set_ylabel('', fontsize=14)
show_values(axes[0, 2], 'h', '${:,.2f}')
axes[0, 2].get_legend().remove()
# Creates percent bonus, job title, and gender subplot
sns.barplot(ax=axes[0, 3], x='Percent Bonus', y='Job Title', hue='Gender', data=pay, palette='PiYG')
axes[0, 3].set_title('Gender and Job Title versus Percent Bonus', fontsize=16)
axes[0, 3].set_xlabel('Percent Bonus (%)', fontsize=14)
axes[0, 3].set_ylabel('', fontsize=14)
show_values(axes[0, 3], 'h', '{:,.2f}%')
axes[0, 3].get_legend().remove()
# Creates base pay, department, and gender subplot
sns.barplot(ax=axes[1, 0], x='Base Pay', y='Department', hue='Gender', data=pay, palette='PiYG')
axes[1, 0].set_title('Gender and Department versus Base Pay', fontsize=16)
axes[1, 0].set_xlabel('Base Pay ($)', fontsize=14)
axes[1, 0].set_ylabel('Department', fontsize=14)
show_values(axes[1, 0], 'h', '${:,.2f}')
axes[1, 0].get_legend().remove()
# Creates bonus, department, and gender subplot
sns.barplot(ax=axes[1, 1], x='Bonus', y='Department', hue='Gender', data=pay, palette='PiYG')
axes[1, 1].set_title('Gender and Department versus Bonus', fontsize=16)
axes[1, 1].set_xlabel('Bonus ($)', fontsize=14)
axes[1, 1].set_ylabel('', fontsize=14)
show_values(axes[1, 1], 'h', '${:,.2f}')
axes[1, 1].get_legend().remove()
# Creates total pay, department, and gender subplot
sns.barplot(ax=axes[1, 2], x='Total Pay', y='Department', hue='Gender', data=pay, palette='PiYG')
axes[1, 2].set_title('Gender and Department versus Total Pay', fontsize=16)
axes[1, 2].set_xlabel('Total Pay ($)', fontsize=14)
axes[1, 2].set_ylabel('', fontsize=14)
show_values(axes[1, 2], 'h', '${:,.2f}')
axes[1, 2].get_legend().remove()
# Creates percent bonus, department, and gender subplot
sns.barplot(ax=axes[1, 3], x='Percent Bonus', y='Department', hue='Gender', data=pay, palette='PiYG')
axes[1, 3].set_title('Gender and Department versus Percent Bonus', fontsize=16)
axes[1, 3].set_xlabel('Percent Bonus (%)', fontsize=14)
axes[1, 3].set_ylabel('', fontsize=14)
show_values(axes[1, 3], 'h', '{:,.2f}%')
axes[1, 3].get_legend().remove()
# Creates a strip and box plot with 8 subplots showing the difference in pay statistics
# between gender by job title and department
fig, axes = plt.subplots(2, 4, figsize=(30, 20), sharey='row')
fig.suptitle('Pay Statistics by Job Title, Department, and Gender', fontsize=20)
# Creates base pay, job title, and gender subplot
sns.stripplot(ax=axes[0, 0], x='Base Pay', y='Job Title', hue='Gender', data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[0, 0], x='Base Pay', y='Job Title', hue='Gender', data=pay, palette='PiYG', showfliers=False)
axes[0, 0].set_title('Gender and Job Title versus Base Pay', fontsize=16)
axes[0, 0].set_xlabel('Base Pay ($)', fontsize=14)
axes[0, 0].set_ylabel('Job Title', fontsize=14)
sns.move_legend(axes[0, 0], 'upper right', bbox_to_anchor=(4.5, 1.25), frameon=False, fontsize=16, title='')
# Creates bonus, job title, and gender subplot
sns.stripplot(ax=axes[0, 1], x='Bonus', y='Job Title', hue='Gender', data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[0, 1], x='Bonus', y='Job Title', hue='Gender', data=pay, palette='PiYG', showfliers=False)
axes[0, 1].set_title('Gender and Job Title versus Bonus', fontsize=16)
axes[0, 1].set_xlabel('Bonus ($)', fontsize=14)
axes[0, 1].set_ylabel('', fontsize=14)
axes[0, 1].get_legend().remove()
# Creates total pay, job title, and gender subplot
sns.stripplot(ax=axes[0, 2], x='Total Pay', y='Job Title', hue='Gender', data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[0, 2], x='Total Pay', y='Job Title', hue='Gender', data=pay, palette='PiYG', showfliers=False)
axes[0, 2].set_title('Gender and Job Title versus Total Pay', fontsize=16)
axes[0, 2].set_xlabel('Total Pay ($)', fontsize=14)
axes[0, 2].set_ylabel('', fontsize=14)
axes[0, 2].get_legend().remove()
# Creates percent bonus, job title, and gender subplot
sns.stripplot(ax=axes[0, 3], x='Percent Bonus', y='Job Title', hue='Gender', data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[0, 3], x='Percent Bonus', y='Job Title', hue='Gender', data=pay, palette='PiYG', showfliers=False)
axes[0, 3].set_title('Gender and Job Title versus Percent Bonus', fontsize=16)
axes[0, 3].set_xlabel('Percent Bonus (%)', fontsize=14)
axes[0, 3].set_ylabel('', fontsize=14)
axes[0, 3].get_legend().remove()
# Creates base pay, department, and gender subplot
sns.stripplot(ax=axes[1, 0], x='Base Pay', y='Department', hue='Gender', data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[1, 0], x='Base Pay', y='Department', hue='Gender', data=pay, palette='PiYG', showfliers=False)
axes[1, 0].set_title('Gender and Department versus Base Pay', fontsize=16)
axes[1, 0].set_xlabel('Base Pay ($)', fontsize=14)
axes[1, 0].set_ylabel('Department', fontsize=14)
axes[1, 0].get_legend().remove()
# Creates bonus, department, and gender subplot
sns.stripplot(ax=axes[1, 1], x='Bonus', y='Department', hue='Gender', data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[1, 1], x='Bonus', y='Department', hue='Gender', data=pay, palette='PiYG', showfliers=False)
axes[1, 1].set_title('Gender and Department versus Bonus', fontsize=16)
axes[1, 1].set_xlabel('Bonus ($)', fontsize=14)
axes[1, 1].set_ylabel('', fontsize=14)
axes[1, 1].get_legend().remove()
# Creates total pay, department, and gender subplot
sns.stripplot(ax=axes[1, 2], x='Total Pay', y='Department', hue='Gender', data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[1, 2], x='Total Pay', y='Department', hue='Gender', data=pay, palette='PiYG', showfliers=False)
axes[1, 2].set_title('Gender and Department versus Total Pay', fontsize=16)
axes[1, 2].set_xlabel('Total Pay ($)', fontsize=14)
axes[1, 2].set_ylabel('', fontsize=14)
axes[1, 2].get_legend().remove()
# Creates percent bonus, department, and gender subplot
sns.stripplot(ax=axes[1, 3], x='Percent Bonus', y='Department',
hue='Gender', data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[1, 3], x='Percent Bonus', y='Department', hue='Gender', data=pay, palette='PiYG', showfliers=False)
axes[1, 3].set_title('Gender and Department versus Percent Bonus', fontsize=16)
axes[1, 3].set_xlabel('Percent Bonus (%)', fontsize=14)
axes[1, 3].set_ylabel('', fontsize=14)
axes[1, 3].get_legend().remove()
# Creates a swarm and violin plot with 8 subplots showing the difference in pay statistics
# between gender by job title and department
fig, axes = plt.subplots(2, 4, figsize=(30, 20), sharey='row')
fig.suptitle('Pay Statistics by Job Title, Department, and Gender', fontsize=20)
# Creates base pay, job title, and gender subplot
sns.swarmplot(ax=axes[0, 0], x='Base Pay', y='Job Title', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[0, 0], x='Base Pay', y='Job Title', hue='Gender', data=pay, palette='PiYG')
axes[0, 0].set_title('Gender and Job Title versus Base Pay', fontsize=16)
axes[0, 0].set_xlabel('Base Pay ($)', fontsize=14)
axes[0, 0].set_ylabel('Job Title', fontsize=14)
sns.move_legend(axes[0, 0], 'upper right', bbox_to_anchor=(4.5, 1.25), frameon=False, fontsize=16, title='')
# Creates bonus, job title, and gender subplot
sns.swarmplot(ax=axes[0, 1], x='Bonus', y='Job Title', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[0, 1], x='Bonus', y='Job Title', hue='Gender', data=pay, palette='PiYG')
axes[0, 1].set_title('Gender and Job Title versus Bonus', fontsize=16)
axes[0, 1].set_xlabel('Bonus ($)', fontsize=14)
axes[0, 1].set_ylabel('', fontsize=14)
axes[0, 1].get_legend().remove()
# Creates total pay, job title, and gender subplot
sns.swarmplot(ax=axes[0, 2], x='Total Pay', y='Job Title', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[0, 2], x='Total Pay', y='Job Title', hue='Gender', data=pay, palette='PiYG')
axes[0, 2].set_title('Gender and Job Title versus Total Pay', fontsize=16)
axes[0, 2].set_xlabel('Total Pay ($)', fontsize=14)
axes[0, 2].set_ylabel('', fontsize=14)
axes[0, 2].get_legend().remove()
# Creates percent bonus, job title, and gender subplot
sns.swarmplot(ax=axes[0, 3], x='Percent Bonus', y='Job Title', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[0, 3], x='Percent Bonus', y='Job Title', hue='Gender', data=pay, palette='PiYG')
axes[0, 3].set_title('Gender and Job Title versus Percent Bonus', fontsize=16)
axes[0, 3].set_xlabel('Percent Bonus (%)', fontsize=14)
axes[0, 3].set_ylabel('', fontsize=14)
axes[0, 3].get_legend().remove()
# Creates base pay, department, and gender subplot
sns.swarmplot(ax=axes[1, 0], x='Base Pay', y='Department', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[1, 0], x='Base Pay', y='Department', hue='Gender', data=pay, palette='PiYG')
axes[1, 0].set_title('Gender and Department versus Base Pay', fontsize=16)
axes[1, 0].set_xlabel('Base Pay ($)', fontsize=14)
axes[1, 0].set_ylabel('Department', fontsize=14)
axes[1, 0].get_legend().remove()
# Creates bonus, department, and gender subplot
sns.swarmplot(ax=axes[1, 1], x='Bonus', y='Department', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[1, 1], x='Bonus', y='Department', hue='Gender', data=pay, palette='PiYG')
axes[1, 1].set_title('Gender and Department versus Bonus', fontsize=16)
axes[1, 1].set_xlabel('Bonus ($)', fontsize=14)
axes[1, 1].set_ylabel('', fontsize=14)
axes[1, 1].get_legend().remove()
# Creates total pay, department, and gender subplot
sns.swarmplot(ax=axes[1, 2], x='Total Pay', y='Department', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[1, 2], x='Total Pay', y='Department', hue='Gender', data=pay, palette='PiYG')
axes[1, 2].set_title('Gender and Department versus Total Pay', fontsize=16)
axes[1, 2].set_xlabel('Total Pay ($)', fontsize=14)
axes[1, 2].set_ylabel('', fontsize=14)
axes[1, 2].get_legend().remove()
# Creates percent bonus, department, and gender subplot
sns.swarmplot(ax=axes[1, 3], x='Percent Bonus', y='Department', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[1, 3], x='Percent Bonus', y='Department', hue='Gender', data=pay, palette='PiYG')
axes[1, 3].set_title('Gender and Department versus Percent Bonus', fontsize=16)
axes[1, 3].set_xlabel('Percent Bonus (%)', fontsize=14)
axes[1, 3].set_ylabel('', fontsize=14)
axes[1, 3].get_legend().remove()
$\;\;\;\;\;\;$Many different types of graphs were created to analyze the difference in a variety of pay statistics by department, job title, and gender. The pay statistics analyzed include base pay, bonus, total pay, and percent bonus. The types of graphs created include bar graphs, box plots, strip plots, violin plots, and swarm plots. From these graphs, it can be seen that the gender pay gap varies greatly between different jobs and departments. Specifically, between jobs, the gender pay gap varies immensely. While between each department the gender pay gap varies slightly, the general trends follow the trends experienced by all of the data as a whole. The only exceptions to this are that in the management, administration, and engineering departments, men have a higher mean bonus than women. Other than this, the department data follows the main trends with only slight variation. Contrastingly, each job seems to follow its own rules. For instance, the base pay and total pay values for the graphic designers, warehouse associates, financial analysts, data scientists, and managers are higher for females than they are for males. This is extremely surprising, as this is the exact opposite trend that is followed by the rest of the data. In addition, the male graphic designers, software engineers, drivers, financial analysts, marketing associates, and managers have on average a higher bonus, which contrasts with the trend followed by the majority of the data in which females have a slightly higher bonus. This trend may be the result of one job, warehouse associate, in which the females have a substantially higher mean bonus compared to males. This one job may skew the rest of the data in this direction. Lastly, the male graphic designers, financial analysts, and managers also have a higher percent bonus compared to the females. This is also strange and does not follow the expected trend of the data as a whole. In conclusion, the gender pay gap surprisingly varies slightly between each department, and it, also surprisingly, varies greatly between each job.
# Creates a scatter and line plot with 4 subplots showing the difference in pay statistics
# between gender by performance evaluation score
fig, axes = plt.subplots(1, 4, figsize=(30, 10))
fig.suptitle('Pay Statistics by Gender and Performance Evaluation Score', fontsize=20)
fig.supxlabel('Performance Evaluation Score', fontsize=14)
# Creates base pay and gender subplot
sns.lineplot(ax=axes[0], x='Performance Evaluation', y='Base Pay', hue='Gender', data=pay, palette='PiYG')
sns.scatterplot(ax=axes[0], x='Performance Evaluation', y='Base Pay', hue='Gender',
data=pay, palette='PiYG', alpha=0.7)
axes[0].set_title('Performance Score versus Base Pay and Gender', fontsize=16)
axes[0].set_xlabel('', fontsize=14)
axes[0].set_ylabel('Base Pay ($)', fontsize=14)
sns.move_legend(axes[0], 'upper right', bbox_to_anchor=(4.5, 1.25), frameon=False, fontsize=16, title='')
# Creates bonus and gender subplot
sns.lineplot(ax=axes[1], x='Performance Evaluation', y='Bonus', hue='Gender', data=pay, palette='PiYG')
sns.scatterplot(ax=axes[1], x='Performance Evaluation', y='Bonus', hue='Gender',
data=pay, palette='PiYG', alpha=0.7)
axes[1].set_title('Performance Score versus Bonus and Gender', fontsize=16)
axes[1].set_xlabel('', fontsize=14)
axes[1].set_ylabel('Bonus ($)', fontsize=14)
axes[1].get_legend().remove()
# Creates total pay and gender subplot
sns.lineplot(ax=axes[2], x='Performance Evaluation', y='Total Pay', hue='Gender', data=pay, palette='PiYG')
sns.scatterplot(ax=axes[2], x='Performance Evaluation', y='Total Pay', hue='Gender',
data=pay, palette='PiYG', alpha=0.7)
axes[2].set_title('Performance Score versus Total Pay and Gender', fontsize=16)
axes[2].set_xlabel('', fontsize=14)
axes[2].set_ylabel('Total Pay ($)', fontsize=14)
axes[2].get_legend().remove()
# Creates percent bonus and gender subplot
sns.lineplot(ax=axes[3], x='Performance Evaluation', y='Percent Bonus', hue='Gender', data=pay, palette='PiYG')
sns.scatterplot(ax=axes[3], x='Performance Evaluation', y='Percent Bonus', hue='Gender',
data=pay, palette='PiYG', alpha=0.7)
axes[3].set_title('Performance Score versus Percent Bonus and Gender', fontsize=16)
axes[3].set_xlabel('', fontsize=14)
axes[3].set_ylabel('Percent Bonus (%)', fontsize=14)
axes[3].get_legend().remove()
# Creates a strip and box plot with 4 subplots showing the difference in pay statistics
# between gender by performance evaluation score
fig, axes = plt.subplots(1, 4, figsize=(30, 10))
fig.suptitle('Pay Statistics by Gender and Performance Evaluation Score', fontsize=20)
fig.supxlabel('Performance Evaluation Score', fontsize=14)
# Creates base pay and gender subplot
sns.stripplot(ax=axes[0], x='Performance Evaluation', y='Base Pay', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[0], x='Performance Evaluation', y='Base Pay', hue='Gender',
data=pay, palette='PiYG', showfliers=False)
axes[0].set_title('Performance Score versus Base Pay and Gender', fontsize=16)
axes[0].set_xlabel('', fontsize=14)
axes[0].set_ylabel('Base Pay ($)', fontsize=14)
sns.move_legend(axes[0], 'upper right', bbox_to_anchor=(4.5, 1.25), frameon=False, fontsize=16, title='')
# Creates bonus and gender subplot
sns.stripplot(ax=axes[1], x='Performance Evaluation', y='Bonus', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[1], x='Performance Evaluation', y='Bonus', hue='Gender',
data=pay, palette='PiYG', showfliers=False)
axes[1].set_title('Performance Score versus Bonus and Gender', fontsize=16)
axes[1].set_xlabel('', fontsize=14)
axes[1].set_ylabel('Bonus ($)', fontsize=14)
axes[1].get_legend().remove()
# Creates total pay and gender subplot
sns.stripplot(ax=axes[2], x='Performance Evaluation', y='Total Pay', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[2], x='Performance Evaluation', y='Total Pay', hue='Gender',
data=pay, palette='PiYG', showfliers=False)
axes[2].set_title('Performance Score versus Total Pay and Gender', fontsize=16)
axes[2].set_xlabel('', fontsize=14)
axes[2].set_ylabel('Total Pay ($)', fontsize=14)
axes[2].get_legend().remove()
# Creates percent bonus and gender subplot
sns.stripplot(ax=axes[3], x='Performance Evaluation', y='Percent Bonus', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7)
sns.boxplot(ax=axes[3], x='Performance Evaluation', y='Percent Bonus', hue='Gender',
data=pay, palette='PiYG', showfliers=False)
axes[3].set_title('Performance Score versus Percent Bonus and Gender', fontsize=16)
axes[3].set_xlabel('', fontsize=14)
axes[3].set_ylabel('Percent Bonus (%)', fontsize=14)
axes[3].get_legend().remove()
# Creates a swarm and violin plot with 4 subplots showing the difference in pay statistics
# between gender by performance evaluation score
fig, axes = plt.subplots(1, 4, figsize=(30, 10))
fig.suptitle('Pay Statistics by Gender and Performance Evaluation Score', fontsize=20)
fig.supxlabel('Performance Evaluation Score', fontsize=14)
# Creates base pay and gender subplot
sns.swarmplot(ax=axes[0], x='Performance Evaluation', y='Base Pay', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[0], x='Performance Evaluation', y='Base Pay', hue='Gender',
data=pay, palette='PiYG', inner=None)
axes[0].set_title('Performance Score versus Base Pay and Gender', fontsize=16)
axes[0].set_xlabel('', fontsize=14)
axes[0].set_ylabel('Base Pay ($)', fontsize=14)
sns.move_legend(axes[0], 'upper right', bbox_to_anchor=(4.5, 1.25), frameon=False, fontsize=16, title='')
# Creates bonus and gender subplot
sns.swarmplot(ax=axes[1], x='Performance Evaluation', y='Bonus', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[1], x='Performance Evaluation', y='Bonus', hue='Gender',
data=pay, palette='PiYG', inner=None)
axes[1].set_title('Performance Score versus Bonus and Gender', fontsize=16)
axes[1].set_xlabel('', fontsize=14)
axes[1].set_ylabel('Bonus ($)', fontsize=14)
axes[1].get_legend().remove()
# Creates total pay and gender subplot
sns.swarmplot(ax=axes[2], x='Performance Evaluation', y='Total Pay', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[2], x='Performance Evaluation', y='Total Pay', hue='Gender',
data=pay, palette='PiYG', inner=None)
axes[2].set_title('Performance Score versus Total Pay and Gender', fontsize=16)
axes[2].set_xlabel('', fontsize=14)
axes[2].set_ylabel('Total Pay ($)', fontsize=14)
axes[2].get_legend().remove()
# Creates percent bonus and gender subplot
sns.swarmplot(ax=axes[3], x='Performance Evaluation', y='Percent Bonus', hue='Gender',
data=pay, color='k', dodge=True, alpha=0.7, s=2.25)
sns.violinplot(ax=axes[3], x='Performance Evaluation', y='Percent Bonus', hue='Gender',
data=pay, palette='PiYG', inner=None)
axes[3].set_title('Performance Score versus Percent Bonus and Gender', fontsize=16)
axes[3].set_xlabel('', fontsize=14)
axes[3].set_ylabel('Percent Bonus (%)', fontsize=14)
axes[3].get_legend().remove()
$\;\;\;\;\;\;$Many different types of graphs were created to analyze the difference in a variety of pay statistics by performance evaluation score and gender. The pay statistics analyzed include base pay, bonus, total pay, and percent bonus. The types of graphs created include line plots, scatter plots, box plots, strip plots, violin plots, and swarm plots. From these graphs, it can be seen performance evaluation scores play a large role in the allocation of bonuses, and the percent bonus an employee gets, but plays very little role in the base pay and total pay of an employee. This can be seen as there is a clear positive trend when performance score is compared to bonus and when performance score is compared to percent bonus. Contrastingly, there is no trend between performance score and base pay as well as between performance score and total pay, since these graphs are close to horizontal. This is unexpected, it would have been predicted that when the performance evaluation score is compared to all four pay statistics, there would have been a positive trend. However, this is only true for bonus and percent bonus, while base pay and total pay have a horizontal trend. Furthermore, from these graphs, it can be seen that the gender pay gap does not differ between different performance scores. No matter the performance score the data still follows the general trends initially discovered in the data. Males still have higher base pay and total pay, while females still have higher bonuses and percent bonuses. This information is as expected and unsurprising. In conclusion, performance evaluation scores affect bonuses and percent bonuses, but not base pay and total pay, also the gender pay gap follows the expected trend for all performance scores, so the gender pay gap does not differ between different performance scores.
# Creates a scatter and line plot with 4 subplots showing the difference in pay statistics by seniority
fig, axes = plt.subplots(1, 4, figsize=(30, 10))
fig.suptitle('Pay Statistics by Seniority', fontsize=20)
fig.supxlabel('Seniority', fontsize=14)
# Creates base pay subplot
sns.lineplot(ax=axes[0], x='Seniority', y='Base Pay', data=pay, color='g')
sns.scatterplot(ax=axes[0], x='Seniority', y='Base Pay', data=pay, color='k', alpha=0.7)
axes[0].set_title('Seniority versus Base Pay', fontsize=16)
axes[0].set_xlabel('', fontsize=14)
axes[0].set_ylabel('Base Pay ($)', fontsize=14)
# Creates bonus subplot
sns.lineplot(ax=axes[1], x='Seniority', y='Bonus', data=pay, color='g')
sns.scatterplot(ax=axes[1], x='Seniority', y='Bonus', data=pay, color='k', alpha=0.7)
axes[1].set_title('Seniority versus Bonus', fontsize=16)
axes[1].set_xlabel('', fontsize=14)
axes[1].set_ylabel('Bonus ($)', fontsize=14)
# Creates total pay subplot
sns.lineplot(ax=axes[2], x='Seniority', y='Total Pay', data=pay, color='g')
sns.scatterplot(ax=axes[2], x='Seniority', y='Total Pay', data=pay, color='k', alpha=0.7)
axes[2].set_title('Seniority versus Total Pay', fontsize=16)
axes[2].set_xlabel('', fontsize=14)
axes[2].set_ylabel('Total Pay ($)', fontsize=14)
# Creates percent bonus subplot
sns.lineplot(ax=axes[3], x='Seniority', y='Percent Bonus', data=pay, color='g')
sns.scatterplot(ax=axes[3], x='Seniority', y='Percent Bonus', data=pay, color='k', alpha=0.7)
axes[3].set_title('Seniority versus Percent Bonus', fontsize=16)
axes[3].set_xlabel('', fontsize=14)
axes[3].set_ylabel('Percent Bonus (%)', fontsize=14)
Text(0, 0.5, 'Percent Bonus (%)')
# Creates a strip and box plot with 4 subplots showing the difference in pay statistics by seniority
fig, axes = plt.subplots(1, 4, figsize=(30, 10))
fig.suptitle('Pay Statistics by Seniority', fontsize=20)
fig.supxlabel('Seniority', fontsize=14)
# Creates base pay subplot
sns.stripplot(ax=axes[0], x='Seniority', y='Base Pay', data=pay, color='k', alpha=0.7)
sns.boxplot(ax=axes[0], x='Seniority', y='Base Pay', data=pay, palette='PiYG', showfliers=False)
axes[0].set_title('Seniority versus Base Pay', fontsize=16)
axes[0].set_xlabel('', fontsize=14)
axes[0].set_ylabel('Base Pay ($)', fontsize=14)
# Creates bonus subplot
sns.stripplot(ax=axes[1], x='Seniority', y='Bonus', data=pay, color='k', alpha=0.7)
sns.boxplot(ax=axes[1], x='Seniority', y='Bonus', data=pay, palette='PiYG', showfliers=False)
axes[1].set_title('Seniority versus Bonus', fontsize=16)
axes[1].set_xlabel('', fontsize=14)
axes[1].set_ylabel('Bonus ($)', fontsize=14)
# Creates total pay subplot
sns.stripplot(ax=axes[2], x='Seniority', y='Total Pay', data=pay, color='k', alpha=0.7)
sns.boxplot(ax=axes[2], x='Seniority', y='Total Pay', data=pay, palette='PiYG', showfliers=False)
axes[2].set_title('Seniority versus Total Pay', fontsize=16)
axes[2].set_xlabel('', fontsize=14)
axes[2].set_ylabel('Total Pay ($)', fontsize=14)
# Creates percent bonus subplot
sns.stripplot(ax=axes[3], x='Seniority', y='Percent Bonus', data=pay, color='k', alpha=0.7)
sns.boxplot(ax=axes[3], x='Seniority', y='Percent Bonus', data=pay, palette='PiYG', showfliers=False)
axes[3].set_title('Seniority versus Percent Bonus', fontsize=16)
axes[3].set_xlabel('', fontsize=14)
axes[3].set_ylabel('Percent Bonus (%)', fontsize=14)
Text(0, 0.5, 'Percent Bonus (%)')
# Creates a swarm and violin plot with 4 subplots showing the difference in pay statistics by seniority
fig, axes = plt.subplots(1, 4, figsize=(30, 10))
fig.suptitle('Pay Statistics by Seniority', fontsize=20)
fig.supxlabel('Seniority', fontsize=14)
# Creates base pay subplot
sns.swarmplot(ax=axes[0], x='Seniority', y='Base Pay', data=pay, color='k', alpha=0.7, s=2.25)
sns.violinplot(ax=axes[0], x='Seniority', y='Base Pay', data=pay, palette='PiYG', inner=None)
axes[0].set_title('Seniority versus Base Pay', fontsize=16)
axes[0].set_xlabel('', fontsize=14)
axes[0].set_ylabel('Base Pay ($)', fontsize=14)
# Creates bonus subplot
sns.swarmplot(ax=axes[1], x='Seniority', y='Bonus', data=pay, color='k', alpha=0.7, s=2.25)
sns.violinplot(ax=axes[1], x='Seniority', y='Bonus', data=pay, palette='PiYG', inner=None)
axes[1].set_title('Seniority versus Bonus', fontsize=16)
axes[1].set_xlabel('', fontsize=14)
axes[1].set_ylabel('Bonus ($)', fontsize=14)
# Creates total pay subplot
sns.swarmplot(ax=axes[2], x='Seniority', y='Total Pay', data=pay, color='k', alpha=0.7, s=2.25)
sns.violinplot(ax=axes[2], x='Seniority', y='Total Pay', data=pay, palette='PiYG', inner=None)
axes[2].set_title('Seniority versus Total Pay', fontsize=16)
axes[2].set_xlabel('', fontsize=14)
axes[2].set_ylabel('Total Pay ($)', fontsize=14)
# Creates percent bonus subplot
sns.swarmplot(ax=axes[3], x='Seniority', y='Percent Bonus', data=pay, color='k', alpha=0.7, s=2.25)
sns.violinplot(ax=axes[3], x='Seniority', y='Percent Bonus', data=pay, palette='PiYG', inner=None)
axes[3].set_title('Seniority versus Percent Bonus', fontsize=16)
axes[3].set_xlabel('', fontsize=14)
axes[3].set_ylabel('Percent Bonus (%)', fontsize=14)
Text(0, 0.5, 'Percent Bonus (%)')
$\;\;\;\;\;\;$Many different types of graphs were created to analyze the impact that seniority has on the pay statistics of employees. The pay statistics analyzed include base pay, bonus, total pay, and percent bonus. The types of graphs created include line plots, scatter plots, box plots, strip plots, violin plots, and swarm plots. Seniority has a positive correlation with base pay, bonus, and total pay while it has a negative correlation with percent bonus. This is because while the positive correlation between seniority and base pay and total pay has a high slope, the positive correlation between seniority and bonus has a low slope. This results in a negative correlation between the seniority and percent bonus. This is an interesting result, not entirely unexpected, but also not entirely expected. While the positive correlations between seniority and base pay, bonus, and total pay were entirely expected, the negative correlation between seniority and percent bonus is unexpected. In conclusion, as one becomes a more senior employee, their base pay, bonus, and total pay increase; however, their percent bonus decreases.
# Creates a scatter and line plot showing the difference in seniority by education level
sns.set(rc = {'figure.figsize':(12,10)})
sns.lineplot(x='Education', y='Seniority', data=pay_my_type, color='g')
sns.scatterplot(x='Education', y='Seniority', data=pay_my_type, color='k', alpha=0.7)
plt.title('Seniority versus Education Level', fontsize=16)
plt.xlabel('Education', fontsize=14)
plt.ylabel('Seniority', fontsize=14)
Text(0, 0.5, 'Seniority')
# Creates a strip and box plot showing the difference in seniority by education level
sns.set(rc = {'figure.figsize':(12,10)})
sns.boxplot(x='Education', y='Seniority', data=pay_my_type, palette='PiYG', showfliers=False)
sns.stripplot(x='Education', y='Seniority', data=pay_my_type, color='k', alpha=0.7)
plt.title('Seniority versus Education Level', fontsize=16)
plt.xlabel('Education', fontsize=14)
plt.ylabel('Seniority', fontsize=14)
Text(0, 0.5, 'Seniority')
# Creates a swarm and violin plot showing the difference in seniority by education level
sns.set(rc = {'figure.figsize':(12,10)})
sns.violinplot(x='Education', y='Seniority', data=pay_my_type, palette='PiYG', inner=None)
sns.swarmplot(x='Education', y='Seniority', data=pay_my_type, color='k', alpha=0.7, s=2.25)
plt.title('Seniority versus Education Level', fontsize=16)
plt.xlabel('Education', fontsize=14)
plt.ylabel('Seniority', fontsize=14)
Text(0, 0.5, 'Seniority')
$\;\;\;\;\;\;$Many different types of graphs were created to analyze the impact that education level has on the likelihood that an employee becomes a senior employee. The types of graphs created include line plots, scatter plots, box plots, strip plots, violin plots, and swarm plots. All of these graphs show a nearly horizontal correlation. This means that the education level of the employee does not affect their likelihood to become a senior employee. This is completely unexpected, as it would have been thought that the company would have tried its best to retain those with higher education, resulting in those with a higher level of education having a higher likelihood to become senior employees. However, this is not the case, and the education level of the employee does not affect their likelihood to become a senior employee. In conclusion, these surprising results show that education level is not important when it comes to a company's attempts at retaining its employees.
# Creates a scatter and line plot showing the difference in performance evaluation score by seniority
sns.set(rc = {'figure.figsize':(12,10)})
sns.lineplot(x='Seniority', y='Performance Evaluation', data=pay, color='g')
sns.scatterplot(x='Seniority', y='Performance Evaluation', data=pay, color='k', alpha=0.7)
plt.title('Seniority versus Performance Evaluation Score', fontsize=16)
plt.xlabel('Seniority', fontsize=14)
plt.ylabel('Performance Evaluation Score', fontsize=14)
Text(0, 0.5, 'Performance Evaluation Score')
# Creates a strip and box plot showing the difference in performance evaluation score by seniority
sns.set(rc = {'figure.figsize':(12,10)})
sns.boxplot(x='Seniority', y='Performance Evaluation', data=pay, palette='PiYG', showfliers=False)
sns.stripplot(x='Seniority', y='Performance Evaluation', data=pay, color='k', alpha=0.7)
plt.title('Seniority versus Performance Evaluation Score', fontsize=16)
plt.xlabel('Seniority', fontsize=14)
plt.ylabel('Performance Evaluation Score', fontsize=14)
Text(0, 0.5, 'Performance Evaluation Score')
# Creates a swarm and violin plot showing the difference in performance evaluation score by seniority
sns.set(rc = {'figure.figsize':(12,10)})
sns.violinplot(x='Seniority', y='Performance Evaluation', data=pay, palette='PiYG', inner=None)
sns.swarmplot(x='Seniority', y='Performance Evaluation', data=pay, color='k', alpha=0.7, s=2.25)
plt.title('Seniority versus Performance Evaluation Score', fontsize=16)
plt.xlabel('Seniority', fontsize=14)
plt.ylabel('Performance Evaluation Score', fontsize=14)
Text(0, 0.5, 'Performance Evaluation Score')
$\;\;\;\;\;\;$Many different types of graphs were created to analyze the impact that seniority has on the performance evaluation score that an employee obtains. The types of graphs created include line plots, scatter plots, box plots, strip plots, violin plots, and swarm plots. All of these graphs show a nearly horizontal correlation. This means that seniority does not affect the performance evaluation score that an employee obtains. Therefore, there does not appear to be bias by the performance evaluator toward senior employees, they can get any score just like the rest of the employees. This is expected, the likelihood that there was a substantial amount of bias was minimal, but it was still determined to be worth testing. Also, this horizontal correlation shows that those with higher performance evaluations have a higher likelihood to become senior employees. This company does not appear to prioritize those with high performance evaluation scores over those with lower ones, their employee retention is about the same for all performance scores. This is unexpected, it was thought that the company would have tried harder to retain employees with higher performance scores over those with lower performance scores, but this does not appear to be the case. It appears that the company tries to retain all employees equally. In conclusion, this company does not have a bias towards senior employees, nor does it try harder to retain employees with high performance evaluation scores over those with low performance evaluation scores.
$\;\;\;\;\;\;$To conclude, many correlations have been found in this data set of one thousand employees from Glassdoor. This data set has shown that while males have a higher base pay and total pay, females have a higher bonus and percent bonus. However, these statistics vary greatly through different jobs and departments. Furthermore, while those with higher performance scores receive higher bonuses and percent bonuses, they do not receive higher base pay and total pay. Also, the gender pay gap does not appear to differ throughout different performance scores and there does not appear to be a bias against any gender when determining performance scores. Next, this data set showed that while seniority influences base pay, bonus, and total pay positively, it influences percent bonus negatively. Lastly, this data set also showed that the education level and the performance evaluation score of an employee do not affect their ability to become a senior employee. In conclusion, this data set has resulted in many interesting discoveries regarding employee statistics. However, while this analysis has attempted to be as complete as possible, it did not explore many of the other data values included in the data set. For example, the age column of data could have shown some interesting correlations between it and many other columns of data, but this was not explored at all. If this project were to be done again in the future, the age column would have most definitely been explored further. Be that as it may, this project was challenging as is, creating some of the graphs and subplots was very difficult, so there was not enough time in this instance to explore the age column thoroughly.